product image
Turning Adversaries into Allies: Reversing Typographic Attacks for Multimodal E-Commerce Product Retrieval
Multimodal product retrieval systems in e-commerce platforms rely on effectively combining visual and textual signals to improve search relevance and user experience. However, vision-language models such as CLIP are vulnerable to typographic attacks, where misleading or irrelevant text embedded in images skews model predictions. In this work, we propose a novel method that reverses the logic of typographic attacks by rendering relevant textual content (e.g., titles, descriptions) directly onto product images to perform vision-text compression, thereby strengthening image-text alignment and boosting multimodal product retrieval performance. We evaluate our method on three vertical-specific e-commerce datasets (sneakers, handbags, and trading cards) using six state-of-the-art vision foundation models. Our experiments demonstrate consistent improvements in unimodal and multimodal retrieval accuracy across categories and model families. Our findings suggest that visually rendering product metadata is a simple yet effective enhancement for zero-shot multimodal retrieval in e-commerce applications.
- Asia > South Korea > Seoul > Seoul (0.06)
- North America > United States > New York > New York County > New York City (0.05)
- Oceania > Australia > New South Wales > Sydney (0.04)
- (5 more...)
Towards Cross-Modal Error Detection with Tables and Images
Ovcharenko, Olga, Schelter, Sebastian
Ensuring data quality at scale remains a persistent challenge for large organizations. Despite recent advances, maintaining accurate and consistent data is still complex, especially when dealing with multiple data modalities. Traditional error detection and correction methods tend to focus on a single modality, typically a table, and often miss cross-modal errors that are common in domains like e-Commerce and healthcare, where image, tabular, and text data co-exist. To address this gap, we take an initial step towards cross-modal error detection in tabular data, by benchmarking several methods. Our evaluation spans four datasets and five baseline approaches. Among them, Cleanlab, a label error detection framework, and DataScope, a data valuation method, perform the best when paired with a strong AutoML framework, achieving the highest F1 scores. Our findings indicate that current methods remain limited, particularly when applied to heavy-tailed real-world data, motivating further research in this area.
- North America > United States (0.04)
- Europe > Italy > Lombardy > Milan (0.04)
- Health & Medicine (0.88)
- Information Technology > Services (0.36)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.48)
- Information Technology > Data Science > Data Quality > Data Cleaning (0.47)
Logo-VGR: Visual Grounded Reasoning for Open-world Logo Recognition
Liang, Zichen, Fei, Jingjing, Wang, Jie, Yang, Zheming, Li, Changqing, Wu, Pei, Qiu, Minghui, Yang, Fei, Liu, Xialei
Recent advances in multimodal large language models (MLLMs) have been primarily evaluated on general-purpose benchmarks, while their applications in domain-specific scenarios, such as intelligent product moderation, remain un-derexplored. To address this gap, we introduce an open-world logo recognition benchmark, a core challenge in product moderation. Unlike traditional logo recognition methods that rely on memorizing representations of tens of thousands of brands--an impractical approach in real-world settings--our proposed method, Logo-VGR, enables generalization to large-scale brand recognition with supervision from only a small subset of brands. Specifically, we reformulate logo recognition as a comparison-based task, requiring the model to match product images with candidate logos rather than directly generating brand labels. We further observe that existing models tend to overfit by memorizing brand distributions instead of learning robust multimodal reasoning, which results in poor performance on unseen brands. To overcome this limitation, Logo-VGR introduces a new paradigm of domain-specific multimodal reasoning: Logo Perception Grounding injects domain knowledge, and Logo-Guided Visual Grounded Reasoning enhances the model's reasoning capability. Experimental results show that Logo-VGR outperforms strong baselines by nearly 10 points in OOD settings, demonstrating superior generalization. In recent years, multimodal large language models (MLLMs)(Bai et al., 2025; Wu et al., 2024; Achiam et al., 2023) have been widely applied across various scenarios. Beyond direct zero-shot problem solving, many studies have emphasized post-training paradigms, such as SFT and PPO(Schulman et al., 2017), as well as GRPO(Shao et al., 2024), to adapt models to downstream tasks better.
DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers
Wang, Lizhen, Xia, Zhurong, Hu, Tianshu, Wang, Pengrui, Wei, Pengfei, Zheng, Zerong, Zhou, Ming, Zhang, Yuan, Gao, Mingyuan
In e-commerce and digital marketing, generating high-fidelity human-product demonstration videos is important for effective product presentation. However, most existing frameworks either fail to preserve the identities of both humans and products or lack an understanding of human-product spatial relationships, leading to unrealistic representations and unnatural interactions. To address these challenges, we propose a Diffusion Transformer (DiT)-based framework. Our method simultaneously preserves human identities and product-specific details, such as logos and textures, by injecting paired human-product reference information and utilizing an additional masked cross-attention mechanism. We employ a 3D body mesh template and product bounding boxes to provide precise motion guidance, enabling intuitive alignment of hand gestures with product placements. Additionally, structured text encoding is used to incorporate category-level semantics, enhancing 3D consistency during small rotational changes across frames. Trained on a hybrid dataset with extensive data augmentation strategies, our approach outperforms state-of-the-art techniques in maintaining the identity integrity of both humans and products and generating realistic demonstration motions. Project page: https://lizhenwangt.github.io/DreamActor-H1/.
- Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
- North America > United States (0.04)
RefAdGen: High-Fidelity Advertising Image Generation
The rapid advancement of Artificial Intelligence Generated Content (AIGC) techniques has unlocked opportunities in generating diverse and compelling advertising images based on referenced product images and textual scene descriptions. This capability substantially reduces human labor and production costs in traditional marketing workflows. However, existing AIGC techniques either demand extensive fine-tuning for each referenced image to achieve high fidelity, or they struggle to maintain fidelity across diverse products, making them impractical for e-commerce and marketing industries. To tackle this limitation, we first construct AdProd-100K, a large-scale advertising image generation dataset. A key innovation in its construction is our dual data augmentation strategy, which fosters robust, 3D-aware representations crucial for realistic and high-fidelity image synthesis. Leveraging this dataset, we propose RefAdGen, a generation framework that achieves high fidelity through a decoupled design. The framework enforces precise spatial control by injecting a product mask at the U-Net input, and employs an efficient Attention Fusion Module (AFM) to integrate product features. This design effectively resolves the fidelity-efficiency dilemma present in existing methods. Extensive experiments demonstrate that RefAdGen achieves state-of-the-art performance, showcasing robust generalization by maintaining high fidelity and remarkable visual results for both unseen products and challenging real-world, in-the-wild images. This offers a scalable and cost-effective alternative to traditional workflows. Code and datasets are publicly available at https://github.com/Anonymous-Name-139/RefAdgen.
CTR-Driven Advertising Image Generation with Multimodal Large Language Models
Chen, Xingye, Feng, Wei, Du, Zhenbang, Wang, Weizhen, Chen, Yanyin, Wang, Haohan, Liu, Linkai, Li, Yaoyu, Zhao, Jinyuan, Li, Yu, Zhang, Zheng, Lv, Jingjing, Shen, Junjie, Lin, Zhangang, Shao, Jingping, Shao, Yuanjie, You, Xinge, Gao, Changxin, Sang, Nong
In web data, advertising images are crucial for capturing user attention and improving advertising effectiveness. Most existing methods generate background for products primarily focus on the aesthetic quality, which may fail to achieve satisfactory online performance. To address this limitation, we explore the use of Multimodal Large Language Models (MLLMs) for generating advertising images by optimizing for Click-Through Rate (CTR) as the primary objective. Firstly, we build targeted pre-training tasks, and leverage a large-scale e-commerce multimodal dataset to equip MLLMs with initial capabilities for advertising image generation tasks. To further improve the CTR of generated images, we propose a novel reward model to fine-tune pre-trained MLLMs through Reinforcement Learning (RL), which can jointly utilize multimodal features and accurately reflect user click preferences. Meanwhile, a product-centric preference optimization strategy is developed to ensure that the generated background content aligns with the product characteristics after fine-tuning, enhancing the overall relevance and effectiveness of the advertising images. Extensive experiments have demonstrated that our method achieves state-of-the-art performance in both online and offline metrics. Our code and pre-trained models are publicly available at: https://github.com/Chenguoz/CAIG.
- Europe > Switzerland > Zürich > Zürich (0.14)
- Oceania > Australia > New South Wales > Sydney (0.05)
- Asia > China > Beijing > Beijing (0.05)
- (4 more...)
- Marketing (1.00)
- Information Technology > Services (0.51)
Style4Rec: Enhancing Transformer-based E-commerce Recommendation Systems with Style and Shopping Cart Information
Ugurlu, Berke, Hong, Ming-Yi, Lin, Che
Understanding users' product preferences is essential to the efficacy of a recommendation system. Precision marketing leverages users' historical data to discern these preferences and recommends products that align with them. However, recent browsing and purchase records might better reflect current purchasing inclinations. Transformer-based recommendation systems have made strides in sequential recommendation tasks, but they often fall short in utilizing product image style information and shopping cart data effectively. In light of this, we propose Style4Rec, a transformer-based e-commerce recommendation system that harnesses style and shopping cart information to enhance existing transformer-based sequential product recommendation systems. Style4Rec represents a significant step forward in personalized e-commerce recommendations, outperforming benchmarks across various evaluation metrics. Style4Rec resulted in notable improvements: HR@5 increased from 0.681 to 0.735, NDCG@5 increased from 0.594 to 0.674, and MRR@5 increased from 0.559 to 0.654. We tested our model using an e-commerce dataset from our partnering company and found that it exceeded established transformer-based sequential recommendation benchmarks across various evaluation metrics. Thus, Style4Rec presents a significant step forward in personalized e-commerce recommendation systems.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Virginia > Arlington County > Arlington (0.04)
- Asia > Taiwan > Taiwan Province > Taipei (0.04)
An Evaluation Framework for Product Images Background Inpainting based on Human Feedback and Product Consistency
Liang, Yuqi, Luo, Jun, Guo, Xiaoxi, Bi, Jianqi
In product advertising applications, the automated inpainting of backgrounds utilizing AI techniques in product images has emerged as a significant task. However, the techniques still suffer from issues such as inappropriate background and inconsistent product in generated product images, and existing approaches for evaluating the quality of generated product images are mostly inconsistent with human feedback causing the evaluation for this task to depend on manual annotation. To relieve the issues above, this paper proposes Human Feedback and Product Consistency (HFPC), which can automatically assess the generated product images based on two modules. Firstly, to solve inappropriate backgrounds, human feedback on 44,000 automated inpainting product images is collected to train a reward model based on multi-modal features extracted from BLIP and comparative learning. Secondly, to filter generated product images containing inconsistent products, a fine-tuned segmentation model is employed to segment the product of the original and generated product images and then compare the differences between the above two. Extensive experiments have demonstrated that HFPC can effectively evaluate the quality of generated product images and significantly reduce the expense of manual annotation. Moreover, HFPC achieves state-of-the-art(96.4% in precision) in comparison to other open-source visual-quality-assessment models. Dataset and code are available at: https://github.com/created-Bi/background_inpainting_products_dataset
- Asia > Middle East > Jordan (0.04)
- North America > United States > Indiana > Lake County > Griffith (0.04)
- Asia > China > Beijing > Beijing (0.04)
TryOffDiff: Virtual-Try-Off via High-Fidelity Garment Reconstruction using Diffusion Models
Velioglu, Riza, Bevandic, Petra, Chan, Robin, Hammer, Barbara
This paper introduces Virtual Try-Off (VTOFF), a novel task focused on generating standardized garment images from single photos of clothed individuals. Unlike traditional Virtual Try-On (VTON), which digitally dresses models, VTOFF aims to extract a canonical garment image, posing unique challenges in capturing garment shape, texture, and intricate patterns. This well-defined target makes VTOFF particularly effective for evaluating reconstruction fidelity in generative models. We present TryOffDiff, a model that adapts Stable Diffusion with SigLIP-based visual conditioning to ensure high fidelity and detail retention. Experiments on a modified VITON-HD dataset show that our approach outperforms baseline methods based on pose transfer and virtual try-on with fewer pre- and post-processing steps. Our analysis reveals that traditional image generation metrics inadequately assess reconstruction quality, prompting us to rely on DISTS for more accurate evaluation. Our results highlight the potential of VTOFF to enhance product imagery in e-commerce applications, advance generative model evaluation, and inspire future work on high-fidelity reconstruction. Demo, code, and models are available at: https://rizavelioglu.github.io/tryoffdiff/
- Europe > Germany > North Rhine-Westphalia (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
AI Tailoring: Evaluating Influence of Image Features on Fashion Product Popularity
Identifying key product features that influence consumer preferences is essential in the fashion industry. In this study, we introduce a robust methodology to ascertain the most impactful features in fashion product images, utilizing past market sales data. First, we propose the metric called "influence score" to quantitatively assess the importance of product features. Then we develop a forecasting model, the Fashion Demand Predictor (FDP), which integrates Transformer-based models and Random Forest to predict market popularity based on product images. We employ image-editing diffusion models to modify these images and perform an ablation study, which validates the impact of the highest and lowest-scoring features on the model's popularity predictions. Additionally, we further validate these results through surveys that gather human rankings of preferences, confirming the accuracy of the FDP model's predictions and the efficacy of our method in identifying influential features. Notably, products enhanced with "good" features show marked improvements in predicted popularity over their modified counterparts. Our approach develops a fully automated and systematic framework for fashion image analysis that provides valuable guidance for downstream tasks such as fashion product design and marketing strategy development.
- South America > Brazil (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > Mexico (0.04)
- (20 more...)
- Marketing (0.34)
- Textiles, Apparel & Luxury Goods (0.34)